[Enhancement] Support bare-field join criteria join on <field> shorthand#5517
[Enhancement] Support bare-field join criteria join on <field> shorthand#5517Hanyu-W wants to merge 1 commit into
join on <field> shorthand#5517Conversation
PR Reviewer Guide 🔍(Review updated until commit 19e6693)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 19e6693
Previous suggestionsSuggestions up to commit 374d412
|
join on <field> shorthandjoin on <field> shorthand
| new Field(QualifiedName.of(rightQual, f)))) | ||
| .reduce((l, r) -> new And(l, r)) | ||
| .orElseThrow( | ||
| () -> new SemanticCheckException("join 'on' shorthand requires at least one field")); |
There was a problem hiding this comment.
is this the only error case that can throw here?
There was a problem hiding this comment.
There were two (this one, and an empty-list guard above it). Both are removed in the redesign, so neither throws anymore.
| private static String qualifierOf(UnresolvedPlan plan, String side) { | ||
| Relation relation = findRelation(plan); | ||
| if (relation == null) { | ||
| throw new SemanticCheckException( |
There was a problem hiding this comment.
nice detail on the error message!
consider using the ReportBuilder to split out suggestions from the problem
There was a problem hiding this comment.
That error path is going away — once matching is done by position in the planner, there's no "must be a named table" restriction, so the exception it was attached to is deleted.
| if (rightAlias.isEmpty() && this.right instanceof SubqueryAlias alias) { | ||
| rightAlias = Optional.of(alias.getAlias()); | ||
| } | ||
| rewriteBareFieldCriteria(); |
There was a problem hiding this comment.
Is putting it here the right boundary over resolving these fields in the constructor (or even frontloading it to the AST builder)?
Putting the rewrite here introduces a weird boundary, where now Join plays the role of dynamically parsing itself during attachment based directly on AST grammar choices. That multiple places comment "we actually resolve these fields in attach()" also hints that this is unintuitive.
It might be clearer to put this in the constructor. Alternatively, it's a little more work, but we could consider moving all of leftAlias, rightAlias, joinFields and similar into a dedicated JoinClause data type that supports operations like clause.expandBareFields() and clause.resolveAliases()
There was a problem hiding this comment.
Yes — that's the fix I'm making. I'm going to move the expansion into visitJoin instead of attach(), so the AST keeps query how wrote it and the analysis happens in the planner where it belongs. Explain output will now match the original query too.
Yeah, I'll make the change Ryan suggested to fix this.
| | `source=table1 \| join right=tt on table1.id=t2.id [ source=table2 as t2 \| eval b = id ] \| eval a = 1` | `table1.id, tt.id, tt.b, a` | | ||
|
|
||
| * **Field deduplication in extended syntax** – When using the extended syntax with a field list, duplicate field names in the output are deduplicated according to the `overwrite` option. | ||
| * **Field deduplication in extended syntax** – When using the extended syntax with a field list (`<join-field-list>`), duplicate field names are deduplicated according to the `overwrite` option. The bare-field `on`/`where` shorthand keeps both key columns, unlike the field-list form. |
There was a problem hiding this comment.
the query grammar at the top of the doc wasn't updated to show what join-field-list is
There was a problem hiding this comment.
Good catch! I'll cut it because it isn't adding anything valuable to the docs.
| if (rightAlias.isEmpty() && this.right instanceof SubqueryAlias alias) { | ||
| rightAlias = Optional.of(alias.getAlias()); | ||
| } | ||
| rewriteBareFieldCriteria(); |
There was a problem hiding this comment.
Performing semantic rewriting inside the AST node's attach() mixes tree construction with analysis, which usually lives in the analyzer layer here. The unresolved tree no longer reflects what the user wrote, which can make explain output and debugging confusing. Was an analyzer pass considered?
There was a problem hiding this comment.
Yes — that's the fix I'm making. I'm going to move the expansion into visitJoin instead of attach(), so the AST
keeps query how wrote it and the analysis happens in the planner where it belongs. Explain output will now match the original query too.
Let `on`/`where` join criteria accept a bare field name (or `AND` of bare fields) as shorthand for the qualified equality on that common field: source=t1 | inner join on a t2 == on t1.a = t2.a source=t1 | inner join on a AND b t2 == on t1.a = t2.a AND t1.b = t2.b source=t1 | join on a t2 == on t1.a = t2.a (no prefix/alias) The shorthand behaves exactly like the explicit `on l.f = r.f` criteria: it KEEPS BOTH key columns (the right key renamed `<rightAlias>.f`, or `<rightTable>.f` when no alias is given). This differs from the field-list join (`join f t2`), which merges the duplicate key to a single column. Implementation: - Grammar: reorder the joinCommand alternatives so the criteria alternative is listed first. This lets the no-prefix form `join on a <right>` parse `on` as the criteria keyword instead of the field-list alternative greedily consuming `on`/`where` as a field name. ANTLR ALL(*) picks the first-listed alternative on genuine ambiguity. - AstBuilder: a bare single-part field (or AND-chain of them) is left as the join condition verbatim -- the unresolved AST reflects what the user wrote. - Planner (CalciteRelNodeVisitor.visitJoin): the criteria branch detects a bare-field condition and builds the equi-join from it, resolving each field by stack position (LEFT.f = RIGHT.f) via the existing buildJoinConditionByFieldName helper. No new planner code, and no AST mutation. Because resolution is positional rather than by qualifier, a self-join `join on f t` is a genuine cross-scan equi-join rather than the `f = f` tautology a name-based rewrite would collapse to. Adds AstBuilder, Calcite plan, anonymizer, and integration tests; updates the join command user manual. Signed-off-by: Hanyu Wei <weihanyu@amazon.com>
374d412 to
19e6693
Compare
|
Persistent review updated to latest commit 19e6693 |
|
📝 Implementation note — the description above predates the redesign. Heads up for anyone reading top-to-bottom: after the review feedback, I moved the bare-field expansion out of the AST ( Design (as committed):
Files actually changed (
Tests (re-run locally just now, all green):
|
| * Returns false for anything else (qualified field, comparison, OR); {@code out} is only valid | ||
| * when this returns true, so callers must check the return before reading it. | ||
| */ | ||
| static boolean collectBareFields(UnresolvedExpression expr, List<String> out) { |
There was a problem hiding this comment.
The out list is mutated as a side effect and is only valid when the method returns true — the Javadoc says so, which is good. But this is the classic boolean-return-plus-out-param pattern that's easy to misuse: a caller that reads out after a false return gets a partially-populated list (the recursion adds left-side fields before the right side returns false).
There was a problem hiding this comment.
That's a good point. I refactored the code to Optional<List>, so therefore the list is only assembled when every node in the AND-tree is a bare field, so partial state can't exist.
I also added JoinAndLookupUtilsTest with 5 cases including mixedLeftSideReturnsEmptyNoPartialState, confirming that And(Compare(...), Field("b")) returns Optional.empty(), not a list containing "b".
| node.getJoinCondition() | ||
| .map(c -> rexVisitor.analyzeJoinCondition(c, context)) | ||
| .orElse(context.relBuilder.literal(true)); | ||
| RexNode joinCondition; |
There was a problem hiding this comment.
What happens with a mixed condition like on a AND b.x = c.y ?
There was a problem hiding this comment.
I tested this, collectBareFields returns Optional.empty(), since the Compare node isn't a bare field), so the whole expression falls through to analyzeJoinCondition unchanged. I added two tests (testJoinMixedConditionFallsThrough and testJoinMixedConditionWithInequality) to prove it produces the correct plan via the normal path.
| : JOIN (joinOption)* (fieldList)? right = tableOrSubqueryClause | ||
| | sqlLikeJoinType? JOIN (joinOption)* sideAlias joinHintList? joinCriteria right = tableOrSubqueryClause | ||
| // Criteria alt listed first - so `join on a` reads `on` as the criteria keyword, not a field. | ||
| : sqlLikeJoinType? JOIN (joinOption)* sideAlias joinHintList? joinCriteria right = tableOrSubqueryClause |
There was a problem hiding this comment.
Same as before: the alternative reorder affects every join query's parse, not just the shorthand. testJoinFieldListStillParsesAsFieldList and testJoinPrefixWithoutCriteriaKeywordIsSyntaxError are good guards.
Please still call the grammar precedence change out explicitly in the PR description so reviewers don't miss it — the new comment covers the why well, but a note that existing queries were re-verified would close it out.
There was a problem hiding this comment.
Will do-adding a note to the PR description that the reorder affects all join queries' parse path, and that existing behavior is verified by testJoinFieldListStillParsesAsFieldList + testJoinPrefixWithoutCriteriaKeywordIsSyntaxError (158 AstBuilder + 55 CalcitePPLJoin tests).
Description
Lets
joinon/wherecriteria accept a bare field name (or anANDchain of bare fields) as shorthand for the qualified equality on that common field. Before this change, the only way to join on a shared column was to spell out both qualifiers:The shorthand behaves exactly like the explicit
on l.f = r.fit sugars: it keeps both key columns (the right key surfaced as<rightAlias>.f, or<rightTable>.fwhen no alias is given). This is intentionally different from the field-list joinjoin f t2, which merges the duplicate key into a single column.Two things made this more than a one-line grammar tweak, and shaped the design:
onis a keyword that can also be an identifier, so the no-prefix formjoin on a t2was greedily consumed by the field-list alternative —onparsed as a field name, failing withfield [on] not found. Reordering thejoinCommandalternatives so the criteria alternative is listed first makes ANTLR ALL(*) prefer readingonas the criteria keyword on genuine ambiguity, with zero change to the field-list form's behavior.on acarries no table qualifier, but the keep-both planner branch needsleft.a = right.a. The left table name isn't known at parse time — it's only known once both inputs are attached. So the rewrite is deferred toJoin.attach, where the qualifier is taken from the side alias (left=/right=/AS) when present, and otherwise from the underlying table name (walking the single-child plan chain down to theRelation). An un-aliased subquery side with no reachable table name throws aSemanticCheckExceptiontelling the user to add an alias.This keeps the planner 100% unchanged — the rewritten criteria is an ordinary
Compare("=", l.f, r.f)that flows through the existing keep-both join branch; all shorthand-specific logic lives in the AST node and grammar.Resolves #5484.
Changes
ppl/src/main/antlr/OpenSearchPPLParser.g4joinCommandso the criteria alternative precedes the field-list alternative, sojoin on areadsonas the criteria keyword instead of a fieldppl/.../ppl/parser/AstBuilder.javaANDchain) as the join condition rather than treating it as a field listcore/.../ast/tree/Join.javaattach()rewrites the bare-field condition into the qualified equalityl.f = r.fusing side aliases or table names;joinConditionmade non-final; newrewriteBareFieldCriteria/collectBareFields/buildQualifiedEqualityConjunction/qualifierOf/findRelationhelpers;SemanticCheckExceptionfor an unresolvable un-aliased sidedocs/user/ppl/cmd/join.md<joinCriteria>parameter descriptions, and the field-dedup limitation noteppl/.../calcite/CalcitePPLJoinTest.java,ppl/.../parser/AstBuilderTest.java,ppl/.../utils/PPLQueryDataAnonymizerTest.javainteg-test/.../calcite/remote/CalcitePPLJoinIT.javaTesting
CalcitePPLJoinITon a live cluster, exercising the new shorthand against pre-change behavior:testJoinWithImplicitFieldinner join on name— keep-both; asserts shorthand output is identical to expliciton l.name = r.name; co-named columns resolve to the lefttestJoinNoPrefixImplicitFieldjoin on name— no prefix;onparsed as the criteria keyword, identical to the explicit form (not the field-list merge)testLeftJoinWithImplicitFieldleft join on name— unmatched-left rows keep the left key (no NULL key; keep-both does not coalesce)CalcitePPLJoinIT(new) totalUnit suites (all green, run locally):
CalcitePPLJoinTest: 50/50 (10 new) — plan shape for single/multi-field, alias,where, no-prefix, left/semi, self-join tautology, and not-on-both-sides cases.AstBuilderTest: 158/158 (11 new) — rewrite-vs-passthrough decisions (bare field rewritten; qualified/comparison criteria untouched; field-list still parses as a field list; prefix-without-criteria is a syntax error).PPLQueryDataAnonymizerTest: 119/119 (1 new) — shorthand anonymizes correctly.Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.